feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient by iscekic · Pull Request #3982 · Kilo-Org/cloud

iscekic · 2026-06-11T22:05:47Z

Benchmark-driven decision engine and `kilo-auto/efficient`

Summary

Adds a benchmark-driven model-routing pipeline behind a new hidden virtual model, kilo-auto/efficient: route each request to the cheapest model that is proven (by our own benchmarks) to be accurate enough for the request's difficulty.

Three moving parts:

services/auto-routing-benchmark (new Cloudflare Worker): runs two deterministic benchmarks — classifier prompt replay via OpenRouter, and decider golden tasks through the real kilo CLI in a Cloudflare Container — writes normalized results to D1, and publishes a routing table (per-difficulty-tier ranked candidates) plus a classifier winner.
services/auto-routing (existing worker): /decide classifies the request, derives a difficulty tier, and picks the cheapest above-threshold model from the routing table, with session-sticky decisions held in a Durable Object.
apps/web (gateway): exposes kilo-auto/efficient, blocks on /decide with a 2s timeout, falls back to the balanced Qwen default, bills the classifier LLM cost to the requesting user, and adds an admin panel for the whole pipeline.

Shared classifier code (prompt, parsing, fallback, taxonomy, tier derivation, routing-table schema) moves into the new packages/auto-routing-contracts package, so the benchmark replays exactly the code the production worker executes.

Architecture

client request (model = kilo-auto/efficient)
        │
        ▼
┌────────────────────────────┐  POST /decide (2s timeout;      ┌─────────────────────────────────┐
│ apps/web gateway           │  null/timeout → balanced Qwen)  │ services/auto-routing           │
│ · resolves kilo-auto/*     │ ───────────────────────────────▶│ · classify request (LLM)        │
│ · applies pinned           │                                 │ · derive difficulty tier        │
│   reasoningEffort          │                                 │ · cheapest above-threshold pick │
│ · bills classifier cost    │                                 │ · sticky decision per           │
│ · admin panel              │                                 │   conversation (DO)             │
└──────────┬─────────────────┘                                 └──────────────┬──────────────────┘
           │ admin proxy +                                     routing table & classifier winner:
           │ 6h token mint                                     isolate cache 60s → KV 1h →
           │ (internal secret)                                 service binding origin
           ▼                                                                  ▼
┌───────────────────────────────────────────────────────────────────────────────────────────────┐
│ services/auto-routing-benchmark (new worker)                                                   │
│ · /admin/config /admin/runs /admin/routing-table /admin/classifier-winner /admin/debug-cli     │
│ · Queue (+DLQ) fans out per-model jobs                                                         │
│ · classifier bench: 72-case prompt replay via OpenRouter                                       │
│ · decider bench: 76 golden tasks via `kilo` CLI in a Cloudflare Container                      │
│ · D1 (drizzle, normalized): runs, case results, summaries, published routing tables            │
└───────────────────────────────────────────────────────────────────────────────────────────────┘

Benchmark worker (`services/auto-routing-benchmark`)

Classifier benchmark

Replays 72 normalized classifier inputs through OpenRouter using the exact production classifier code (@kilocode/auto-routing-contracts/classifier). Each output is graded per-field against a hand-labeled expectation via CLASSIFIER_FIELD_WEIGHTS (src/grading.ts): taskType 0.25, reasoningComplexity 0.20, contextComplexity 0.15, executionMode 0.15, subtaskType 0.10, requiresTools 0.10, riskLevel 0.05. Heuristic-fallback outputs score 0.

The winner (src/winner.ts) is the cheapest model meeting the run's accuracy threshold (most accurate one if none do). It feeds the worker's classifier-model resolution chain (below).

Decider benchmark

Runs 76 golden tasks per candidate model through the real kilo CLI (@kilocode/cli) inside a Cloudflare Container (container/Dockerfile + container/server.mjs, node:22-slim, standard-2). Grading is purely mechanical — exact / contains_all / regex / json_equal checks, no LLM judges; golden answers were hand-derived and mechanically re-verified where executable. Cases include genuinely agentic tasks performed with file/terminal tools inside the container (deterministic: no repo, no network).

Execution details:

One queue message per (model, 10-case chunk); each chunk gets its own container instance (runId:model:chunk) so models/chunks never share state. CLI runs are serialized per instance (the CLI's sqlite state is not safe under concurrent first runs); a /warmup endpoint absorbs the one-time sqlite migration before the case loop.
Each candidate's pinned reasoningEffort is forwarded as the CLI's --variant, so the benchmark measures the model exactly as it will be served.
The CLI authenticates as a real Kilo user: the worker mints a short-lived token once per queue message via apps/web's internal endpoint (token only ever lives in a child-process env var, never logged or written to disk).
Empty-output sessions (exit 0, no assistant text) are retried once, mirroring the production classifier's retry policy; costs of both attempts are summed.

Datasets

Both datasets cover all 18 (taskType, subtaskType) taxonomy pairs with at least 4 cases per pair — enforced by tests (src/datasets/*.test.ts). Decider cases each carry exactly one difficulty tier with at least 4 distinct task types per tier.

D1 schema (`src/db-schema.ts`)

Fully normalized, zero JSON blob columns, composite-PK-only access:

benchmark_config + config_classifier_models + config_decider_models — admin config (incl. per-decider-model reasoning_effort).
benchmark_runs — carries a config snapshot (min_accuracy, switch_cost_factor, max_concurrency, benchmark_user_id) taken at startRun time, so mid-run admin edits can't skew results. All job processing and publishing reads the snapshot, never live config.
run_models — which models were enqueued vs. skipped, with the pinned reasoning_effort snapshot.
case_results — per (run, model, case) score/latency/cost plus diagnostics (classifier fallback reason, CLI exit code/output prefix/event tail).
model_summaries — per (run, model, tier) aggregates. Carried summaries: models with prior results are skipped on new runs (their latest summaries are copied in with carried=true), so re-runs only spend on new candidates; the admin can force a full re-run.
routing_tables + routing_table_candidates — published tables, queryable history.

Single squashed baseline migration (migrations/0000_amused_shard.sql), applied by a predeploy script (wrangler d1 migrations apply --remote) which the CI deploy workflow now runs for any worker that defines one (.github/workflows/deploy-workers.yml).

Publishing

On run completion the worker builds the routing table from the run's own snapshot (src/routing-table-builder.ts): per tier, candidates are ranked best-bang-for-buck (above-threshold cheapest-first, below-threshold by accuracy). Models with zero graded cases or no cost signal in a tier are excluded; if any tier ends up empty the publish is skipped and the previous table stays live (schema enforces .min(1) per tier). Publishing only deletes the KV cache keys so the auto-routing worker repopulates from D1 on the next read.

Decision engine (`services/auto-routing`)

/decide (existing endpoint, now decision-capable):

Classifies the request (per-conversation classification cache in a Durable Object, keyed by classifier model + content hash).
Derives a difficulty tier (deriveDifficultyTier in contracts: reasoning complexity dominates at 2x weight; context, execution mode, and risk nudge borderline cases).
Picks from the routing table (src/decision-engine.ts): cheapest above-threshold candidate for the tier — unless the session has an incumbent.

Session stickiness: the conversation's Durable Object remembers the last served model. The incumbent is kept while it still meets the tier's accuracy threshold, unless the fresh pick is cheaper by more than the table's switchCostFactor. Rationale (commented in code): a model switch discards the provider's prompt cache, and rebuilding it costs full-price input tokens (4–10x cache-read rates) on a context that dominates agent-session spend — switching only pays off when recurring per-turn savings clearly exceed that one-time penalty. Sticky state trusts only real classifier output: heuristic fallbacks never re-anchor the session's model.

Routing table access: read-through chain — isolate-local 60s TTL cache → KV (1h TTL, shared AUTO_ROUTING_CONFIG namespace) → service binding to the benchmark worker's D1-backed /admin/routing-table. Corrupt KV values are treated as misses; origin failures degrade to null (no decision) rather than erroring the request.

Classifier model resolution (src/classifier-config.ts): admin KV override → benchmark winner (same KV read-through, derived on read) → built-in default google/gemini-2.5-flash-lite. A benchmark-origin failure never discards a healthy override.

Gateway (`apps/web`)

kilo-auto/efficient (src/lib/ai-gateway/auto-model/index.ts): hidden virtual model (excluded from /models, usable by id) with the same catalog properties as balanced — intended to eventually replace it, hidden while validated on Kilo team traffic.
Resolution (auto-model/resolution.ts + auto-routing-decision.ts): blocks on /decide with a 2s timeout; on a decision, serves the decided model and applies its pinned reasoningEffort so it runs under the same conditions the benchmark measured. On null/timeout/error, serves BALANCED_QWEN_MODEL — an efficient request never degrades below balanced.
Billing: the classifier LLM cost returned by /decide is billed to the requesting user as a separate microdollar usage row (requested_model: kilo-auto/efficient), so routing overhead is visible and attributed rather than absorbed.
Admin panel (admin/auto-routing/BenchmarksSection.tsx, proxied through admin API routes with the internal secret): config editor (classifier/decider model lists, per-decider reasoningEffort, minAccuracy, switchCostFactor, maxConcurrency, benchmarkUserId), run triggers with a force-rerun toggle, run history, and the live published routing table.
Config-save validation (admin/api/auto-routing/benchmark-config/route.ts): every decider model must be servable on all gateway chat API kinds (chat_completions, responses, messages) by the provider the gateway would route it to — the routing table deliberately carries no per-protocol metadata, so this invariant is enforced at write time.
Token mint (api/internal/auto-routing-benchmark/token/route.ts): POST gated by INTERNAL_API_SECRET; mints a 6h full user API token (tokenSource: auto-routing-benchmark) for the decider CLI's identity/billing.

Design properties

No fabricated data anywhere. There is no default routing table: /decide returns null decisions until a benchmark publishes one, and the gateway serves balanced fallbacks. There is no default benchmark config: runs refuse to start until an admin saves one (and decider runs additionally fail fast without a benchmarkUserId).
Deterministic, reproducible grading. Mechanical checks only; run-level config snapshots; routing tables built from the run's snapshot, not live config.
Cheap iteration. Carried summaries mean adding one candidate model re-benchmarks only that model; config-only changes (model removed, threshold tweaked) republish instantly with zero spend.
Graceful degradation at every layer. Corrupt KV → miss; origin failure → previous behavior; classifier failure → no decision → balanced fallback; publish failure → previous table stays live.

Infrastructure

D1 auto-routing-benchmark in region EEUR, primary in Frankfurt (colo FRA — next to the backend; verified via wrangler d1 info).
Queue auto-routing-benchmark-jobs (max_concurrency 4, max_retries 2) + DLQ auto-routing-benchmark-dlq.
Container app auto-routing-benchmark-runner (standard-2, max 40 instances), image built and pushed by wrangler deploy.
Service binding auto-routing → auto-routing-benchmark; shared KV namespace AUTO_ROUTING_CONFIG.
Both workers already run this branch's code; the D1 database is empty pending admin setup.

Post-merge deploy / cutover checklist

Merge → Vercel ships the gateway side (kilo-auto/efficient, admin panel, token mint).
First post-merge worker deploy runs the D1 migration via the new CI predeploy hook — CI's CLOUDFLARE_API_TOKEN needs D1 edit permission (the deploy will surface it if missing).
Admin saves a benchmark config: benchmarkUserId is required for decider runs (consider a dedicated service account — its account is billed for CLI usage); suggested switchCostFactor starting value: 3.
Trigger a classifier run and a decider run from the admin panel.
Clear the leftover classifier_model KV override (currently set to flash-lite) if the benchmark winner should drive classifier selection.

Reviewer notes

The exact decider check also accepts the last non-empty output line (src/grading.ts): agent harnesses sometimes prepend preamble despite instructions; wrong answers fail either way.
@kilocode/cli@latest is resolved at image build time, i.e. each deploy pins whatever was latest then; re-deploy to pick up a newer CLI.
The token-mint endpoint is gated by INTERNAL_API_SECRET and can mint for any user id; scoping it to the configured benchmark user is a reasonable follow-up.
The decider benchmark exercises models through chat_completions only (the CLI's path). Config-save validation guarantees candidates are servable on all three chat API kinds, but accuracy is only measured on one.

…ontracts

…ng table

…back

…ic checkers

…f-by-one case

…ation and table publish

…dpoints

Mints a short-lived (6h) user API token for a given userId, guarded by the shared internal secret over Authorization: Bearer. The decider benchmark uses this to authenticate the kilo CLI against the gateway under a real user's identity.

… container The decider benchmark now executes each case through the stable kilo CLI (@kilocode/cli) running in a Cloudflare Container, instead of bare OpenRouter chat completions, so it measures the real agent harness. - Container (Dockerfile + dependency-free server.mjs) spawns `kilo run --format json --auto` per case; the kilo user token is injected only as a child-process env var, never logged or written to disk. - BenchRunnerContainer DO + wrangler containers/durable_objects/migrations. - kilo-events.ts: pure parser for the CLI JSON event stream (text + cost), tolerant of both part.* and flattened event shapes. - cli-runner.ts: proxies a case to the container and parses the result. - run.ts: chunks decider cases (10/chunk) into per-(model,chunk) queue messages; fetches a short-lived user token once per message; fails fast when benchmarkUserId is unset (plus a defensive per-case guard). Classifier path unchanged. - New benchmarkUserId config field (nullable) on BenchmarkConfig. - vitest aliases @cloudflare/containers to a node-safe stub so unit tests can import the worker entry without the cloudflare:workers chain.

Adds a Benchmark user id input to the benchmark config editor (empty -> null), with help text noting decider runs fail until it is set. Round-trips through configToFormState/formStateToConfig.

…isions

…retries - accept step_finish (underscore) events so per-case cost is summed - retry once when a CLI session ends with no assistant text - exact checks also accept the last non-empty output line - uniform final-answer suffix on decider prompts - /admin/debug-cli endpoint returning raw CLI events for diagnosis

…decider cases

…nce exhaustion

- serialize CLI runs per container and run decider cases sequentially (the CLI sqlite migration is unsafe under concurrent sessions) - add dead-letter queue and raise container instance ceiling - redact the kilo token from captured stderr before it leaves the container - timing-safe secret comparison and tokenSource audit field on minted tokens - validate persisted routing tables before serving them from the admin API - regenerate worker types with the production web base URL - dedupe the routing-table response schema; tier boundary tests

kilo-code-bot · 2026-06-11T23:17:51Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

All previously flagged issues are now resolved; the incremental commit correctly fixes the HTTP 400 precondition response for POST /admin/runs with no saved config.

Resolved Issues

File	Issue	Status
`services/auto-routing-benchmark/src/admin.ts`	`startRun` null-config throw returned HTTP 500 instead of 400 — user-facing precondition error misclassified as server fault	✅ Fixed (`165240b`)
`apps/web/src/lib/ai-gateway/auto-routing-mirror.ts`	Stale comment referencing `EfficientDecisionParams`	✅ Fixed (`a449c26`)
`services/auto-routing-benchmark/src/run.ts`	Redundant `getRunState` D1 round-trip	✅ Fixed (`82aef0b`)
`apps/web/src/app/api/internal/auto-routing-benchmark/token/route.ts`	Local `timingSafeStringEqual` — now consolidated into `@kilocode/encryption`	✅ Fixed
`services/auto-routing-benchmark/src/ttl-cache.ts`	Duplicated `ttlCached` utility — now promoted to `@kilocode/worker-utils`	✅ Fixed
`packages/auto-routing-contracts/src/benchmark.ts`	`ReasoningEffortSchema` duplicated — now canonical in `tiers.ts`	✅ Fixed
`services/auto-routing-benchmark/container/server.mjs:109`	`child.on('error')` and `child.on('close')` both calling `finish()` without a guard	✅ Fixed (`ba3b3be`)
`services/auto-routing-benchmark/src/index.ts:29`	`processJob` unhandled throw crashing the queue handler	✅ Fixed (`ba3b3be`)
`services/auto-routing/src/classifier-config.ts`	Missing `.catch()` guard on `kvReadThrough`	✅ Fixed (`01e4bd9`)
`services/auto-routing-benchmark/src/routing-table-builder.ts`	Null-cost summaries excluded from tier ranking	✅ Fixed (`71222ca`)
`services/auto-routing-benchmark/src/db-schema.ts`	Redundant `idx_case_results_run` index	✅ Fixed (`354054d`)
`packages/auto-routing-contracts/src/routing-table.ts`	`ClassifierApiKindSchema` validation	✅ Fixed
`services/auto-routing-benchmark/container/server.mjs`	Process-group kill on decider case timeout	✅ Fixed (`ae0cec5`)

Files Reviewed (incremental — 2 files)

services/auto-routing-benchmark/src/admin.ts — 0 issues (WARNING resolved)
services/auto-routing-benchmark/src/admin.test.ts — 0 issues

Previous Review Summary (commit 9eaae60)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit `9eaae60`)

Status: 1 Issue Found | Recommendation: Address before merge

Executive Summary

POST /admin/runs still surfaces a user-facing precondition error (no config saved) as HTTP 500 instead of 400; all incremental commits — Morph provider removal, GLM oneOf schema-logging removal, kimi-k2.7-code thinking-only variants, and test stub hardening — are clean.

Overview

Severity	Count
CRITICAL	0
WARNING	1
SUGGESTION	0

Issue Details (click to expand)

WARNING

File	Line	Issue
`services/auto-routing-benchmark/src/admin.ts`	44	`startRun` null-config throw returns HTTP 500 instead of 400 — user-facing precondition error misclassified as server fault

Resolved Issues (all fixed in prior commits)

File	Issue	Status
`apps/web/src/lib/ai-gateway/auto-routing-mirror.ts`	Stale comment referencing `EfficientDecisionParams`	✅ Fixed (`a449c26`)
`services/auto-routing-benchmark/src/run.ts`	Redundant `getRunState` D1 round-trip	✅ Fixed (`82aef0b`)
`apps/web/src/app/api/internal/auto-routing-benchmark/token/route.ts`	Local `timingSafeStringEqual` — now consolidated into `@kilocode/encryption`	✅ Fixed
`services/auto-routing-benchmark/src/ttl-cache.ts` (×2 copies)	Duplicated `ttlCached` utility — now promoted to `@kilocode/worker-utils`	✅ Fixed
`packages/auto-routing-contracts/src/benchmark.ts`	`ReasoningEffortSchema` duplicated in benchmark and index — now canonical in `tiers.ts`	✅ Fixed
`services/auto-routing-benchmark/container/server.mjs:109`	`child.on('error')` and `child.on('close')` both calling `finish()` without a guard	✅ Fixed (`ba3b3be`)
`services/auto-routing-benchmark/src/index.ts:29`	`processJob` unhandled throw crashing the queue handler	✅ Fixed (`ba3b3be`)
`services/auto-routing/src/classifier-config.ts`	Missing `.catch()` guard on `kvReadThrough` — benchmark failure could discard healthy admin override	✅ Fixed (`01e4bd9`)
`services/auto-routing-benchmark/src/routing-table-builder.ts`	Null-cost summaries excluded from tier ranking	✅ Fixed (`71222ca`)
`services/auto-routing-benchmark/src/db-schema.ts`	Redundant `idx_case_results_run` index — composite PK leftmost column already covers run_id prefix scans	✅ Fixed (`354054d`)
`packages/auto-routing-contracts/src/routing-table.ts`	`ClassifierApiKindSchema` and per-candidate `supportedApiKinds`	✅ Fixed — all-API-kinds validation now enforced at config save in `benchmark-config/route.ts`
`services/auto-routing-benchmark/container/server.mjs`	Process-group kill (detached + `killProcessTree` + `child.on('exit')` backstop)	✅ Fixed (`ae0cec5`)

Incremental Changes Reviewed (commits 1a5d858 → 9eaae60)

apps/web/src/lib/ai-gateway/providers/morph.ts — deleted. morph_warp_grep_free_model removed from kiloExclusiveModels in models.ts. 'morph' provider entry removed from provider-definitions.ts and types.ts. forbidden-free-models.ts correctly adds 'morph-warp-grep-v2' per the AGENTS.md rule for removed free models. The remaining 'morph' entries in inference-provider-id.ts are OpenRouter's third-party inference network identifiers — correct to leave. Clean.
apps/web/src/lib/ai-gateway/schema-logging.ts — deleted. GLM oneOf schema diagnostic logger removed (reverted from glm-logging feature). applyProviderSpecificLogic signature drops organizationId parameter, call site in route.ts updated accordingly. Clean.
apps/web/src/lib/ai-gateway/providers/model-settings.ts — extracts REASONING_VARIANTS_THINKING_ONLY (thinking-only, no instant variant) and adds a specific check for kimi-k2.7-code before the isKimiModel catch-all, correctly giving this model thinking-only variants. Ordering is correct. Clean.
apps/web/src/app/admin/api/auto-routing/benchmark-config/route.test.ts / model-api-kinds.test.ts / openrouter/index.test.ts — tests updated to replace morph_warp_grep_free_model with stable stub models, eliminating dependency on removed provider file. Clean.
apps/web/src/components/shared/ModelCombobox.tsx — visual tweak to free-model badge (ghost style with ring). UI-only. Clean.
apps/web/src/lib/ai-gateway/forbidden-free-models.ts — adds 'morph-warp-grep-v2' to the forbidden set. Clean.

Files Reviewed

services/auto-routing-benchmark/src/admin.ts — 1 issue (carried forward)
apps/web/src/lib/ai-gateway/providers/morph.ts (deleted) — 0 issues
apps/web/src/lib/ai-gateway/schema-logging.ts (deleted) — 0 issues
apps/web/src/lib/ai-gateway/providers/apply-provider-specific-logic.ts — 0 issues
apps/web/src/lib/ai-gateway/providers/model-settings.ts — 0 issues
apps/web/src/lib/ai-gateway/providers/provider-definitions.ts — 0 issues
apps/web/src/lib/ai-gateway/providers/types.ts — 0 issues
apps/web/src/lib/ai-gateway/models.ts — 0 issues
apps/web/src/lib/ai-gateway/forbidden-free-models.ts — 0 issues
apps/web/src/app/api/openrouter/[...path]/route.ts — 0 issues
apps/web/src/components/shared/ModelCombobox.tsx — 0 issues
apps/web/src/app/admin/api/auto-routing/benchmark-config/route.test.ts — 0 issues
apps/web/src/lib/ai-gateway/model-api-kinds.test.ts — 0 issues
apps/web/src/lib/ai-gateway/providers/openrouter/index.test.ts — 0 issues

Fix these issues in Kilo Cloud

_{Reviewed by claude-4.6-sonnet-20260217 · 415,474 tokens}

_{Review guidance: REVIEW.md from base branch main}

…unavailable

…anking

…ed config

…ine migration

… classifier dataset to per-pair coverage

…nomy coverage Grow the decider benchmark from 30 to 76 cases so every (taskType, subtaskType) pair in the classifier taxonomy has at least 4 mechanically-checkable cases, with at least 20 cases per difficulty tier (23 low / 31 medium / 22 high). - DeciderCase gains subtaskType; ids follow the <taskType>-<subtype>-<topic> scheme used by the classifier dataset - Existing cases retagged with subtypes where they genuinely fit (three system-behavior investigation cases moved to planning_design/system_design, the HTTP 201 lookup to investigation/external_research, and the let-closure case reframed as refactoring/migration) - New agentic_execution cases are self-contained file/terminal tasks deterministic in the node:22-slim container - Tests now enforce per-pair and per-tier quotas from the classifierTaxonomy export, subtype/taskType consistency, regex compilability, and json_equal round-tripping

Remember the last served model per conversation in the decision-cache DO and keep it while it meets the current tier's accuracy threshold, unless the fresh pick is cheaper by more than the routing table's new switchCostFactor. Switching models discards provider prompt caches, so a session whose difficulty tier oscillates no longer ping-pongs between models. Decisions report a sticky flag in the response and the auto_routing_decision log line.

…runs, and routing table Store the new BenchmarkConfig.switchCostFactor in the benchmark_config singleton, snapshot it into benchmark_runs at startRun, and carry the run's snapshotted value into published routing tables so the schema's required RoutingTableSchema.switchCostFactor parses on read. Regenerate the squashed D1 baseline migration, add a Switch cost factor field to the admin config form, and update test fixtures (including the apps/web decision fixtures missing the new required sticky flag).

…icient-decision-engine

…r main merge

…e at config save All decider candidates are served via providers that speak every gateway chat API (in practice OpenRouter), so per-candidate supportedApiKinds was dead weight in the contracts, decision engine, D1 schema, and routing table. The one real failure mode - an admin configuring a model whose serving provider is chat-completions-only - is now rejected at config save time instead.

- never let a heuristic fallback classification re-anchor the session's sticky model (same trust rule as the classification cache) - drop the dead ClassifierApiKindSchema export - rename the decider pages-helper case so its id no longer collides with the classifier dataset's debug-fix-pagination-slice in shared telemetry - trim a stale JSDoc in model-api-kinds.ts

- Inject KILO_API_URL into the benchmark container via a new KILO_CLI_API_URL worker var so the kilo CLI targets the same gateway the worker mints tokens against (prod default: api.kilo.ai). - Add .dev.vars.example mapping both URLs to the local apps/web dev server (worker-side localhost, container-side host.docker.internal). - Add AUTO_ROUTING_BENCHMARK_WORKER_URL to the apps/web env example so the admin panel proxies to the local benchmark worker instead of prod. - Work around wrangler force-pulling the amd64 container egress proxy on Apple Silicon (its transparent-proxy setsockopt crashes under emulation, failing every local container start) by pinning the arm64 manifest digest via MINIFLARE_CONTAINER_EGRESS_IMAGE in the dev runner.

…meout The kilo bin is a Node wrapper that spawns the real CLI binary as a grandchild. SIGKILLing only the wrapper orphaned the grandchild on timeout: it kept running (and spending) and held the stdout/stderr pipes open, so 'close' never fired, the case promise never resolved, and the chunk's queue message hung until the runtime cut it — then retried from case 0 and eventually dead-lettered. Observed live: a runaway agentic case ran 20+ minutes past the 180s cap and wedged the whole run. Spawn the CLI detached so it leads its own process group, kill the group on timeout, and add an after-exit grace backstop so a stray pipe-holder can never hang a case again.

…r latency gate - Config gains classifierRepetitions, deciderRepetitions (1-5), and classifierMaxP95LatencyMs (null = no constraint); run rows snapshot the active repetition count and latency budget at start time. - case_results PK extended with rep column; timed_out column added. - model_summaries gains p95_latency_ms (nearest-rank p95 over all rows) and timeouts count. - pickClassifierWinner enforces an optional p95 latency budget: candidates meeting both accuracy and latency are ranked by cost; when none meet the budget, falls back to lowest-p95 among accuracy-meeting models. - classifier_winner contract surfaces the winner's p95LatencyMs. - DECIDER_CHUNK_SIZE reduced from 10 to 5 to stay well within queue consumer wall-clock limits. - Container server propagates timedOut flag through ContainerRunResponse and CliRunResult so timed-out cases are recorded in D1.

…test gaps - Migration 0001: replace "rep"/"timed_out" column refs in INSERT...SELECT with literal 0,0 — old table lacks those columns; D1 silently degrades double-quoted unknowns to string literals, corrupting NOT NULL integer rows. - Contracts: add BenchmarkConfigSchema defaults test (classifierRepetitions=1, deciderRepetitions=1, classifierMaxP95LatencyMs=1000 when omitted). - Benchmark: extract buildDeciderMessages() pure function; add fan-out test asserting models × reps × ceil(76/5) messages each carrying the correct rep.

…olumns Add classifier/decider repetitions (1–5) and classifierMaxP95LatencyMs inputs to the Benchmark Config card; add p95 latency and Timeouts columns to the run summaries table; update test fixtures with new fields.

Set both RunSummariesTable colSpan values back to 6 to match the outer BenchmarkRunsTable's 6-column header (chevron, Kind, Status, Started, Completed, Error). Export configToFormState and formStateToConfig for unit testing and add focused tests covering null-config defaults, round-trip preservation of repetitions/latency fields, and empty-string classifierMaxP95LatencyMs coercing to null.

…icient-decision-engine

…ests Main merged PR #4004 which deleted the morph provider. The two test files that exercised the rejection branch of modelServesAllGatewayChatApis used morph as the only available Kilo-exclusive model on a chat_completions-only gateway. With morph gone, no real catalog entry satisfies that condition. Both test files now stub findKiloExclusiveModel via jest.mock/requireActual so that the marker id 'test-exclusive/alibaba-only' returns a KiloExclusiveModel with gateway: 'alibaba'. The real PROVIDERS.ALIBABA definition supports only chat_completions, so the rejection path is exercised without relying on any specific provider file being present in the catalog.

…onfig The POST /admin/runs handler let startRun's "config not set" precondition error propagate to the global error handler, surfacing a client-side precondition as HTTP 500. Guard the null config in the route handler, mirroring the /admin/debug-cli pattern, and return 400 instead.

iscekic added 23 commits June 11, 2026 21:58

refactor(auto-routing): move classifier core into contracts package

b41e58e

feat(auto-routing): add tier, routing-table, decision and benchmark c…

1fb85f5

…ontracts

feat(auto-routing): add benchmark-driven decision engine and KV routi…

39acfdb

…ng table

feat(auto-routing): return routing decisions from /decide

bd83fdc

fix(auto-routing): log unparseable routing table JSON before falling …

9621d62

…back

feat(auto-routing-benchmark): scaffold benchmark worker with D1 schema

7af1b6d

feat(auto-routing-benchmark): classifier golden dataset and grading

22de713

style(auto-routing-benchmark): apply oxfmt formatting

878e49b

feat(auto-routing-benchmark): decider golden dataset with determinist…

662717c

…ic checkers

fix(auto-routing-benchmark): unambiguous whitespace instruction in of…

110cbd9

…f-by-one case

feat(auto-routing-benchmark): queue-driven benchmark runs with aggreg…

5ce8621

…ation and table publish

feat(auto-routing-benchmark): admin config, runs and routing-table en…

0c763ce

…dpoints

feat(admin): proxy routes for auto-routing benchmark service

c749be2

feat(admin): benchmark config, runs and routing table panel

0e34c02

fix(admin): stabilize benchmark runs polling interval dependencies

fb084c3

feat(admin): benchmark user id config field

d0f13b0

Adds a Benchmark user id input to the benchmark config editor (empty -> null), with help text noting decider runs fail until it is set. Round-trips through configToFormState/formStateToConfig.

feat(gateway): add kilo-auto/efficient with blocking auto-routing dec…

fdc6520

…isions

chore(auto-routing): drop unused import in routing-table contracts

813ea0e

fix(auto-routing-benchmark): warm up CLI container before concurrent …

5ff4b08

…decider cases

fix(auto-routing-benchmark): faster container turnover to avoid insta…

06836cc

…nce exhaustion

iscekic self-assigned this Jun 11, 2026

iscekic marked this pull request as ready for review June 11, 2026 23:12

style(auto-routing-benchmark): format wrangler.jsonc

cac57b7

iscekic force-pushed the feat/auto-routing-efficient-decision-engine branch from ca99949 to cac57b7 Compare June 11, 2026 23:12

kilo-code-bot Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread services/auto-routing-benchmark/container/server.mjs

Comment thread services/auto-routing-benchmark/src/index.ts

iscekic added 5 commits June 12, 2026 16:26

fix(auto-routing): keep classifier override when benchmark origin is …

01e4bd9

…unavailable

docs(contracts): fix stale classifier-winner comment

0828e47

fix(benchmark): exclude no-cost-signal summaries from routing table r…

71222ca

…anking

test(benchmark): fix expected ranking order in no-cost-signal test

6f5fd38

feat(benchmark): remove fabricated default config; runs require a sav…

2cd53f9

…ed config

kilo-code-bot Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread services/auto-routing-benchmark/src/admin.ts

iscekic added 11 commits June 12, 2026 17:01

chore(benchmark): drop redundant case_results index, regenerate basel…

354054d

…ine migration

docs(benchmark): fix stale KV comment in wrangler config

6aba145

feat(auto-routing-benchmark): grade subtaskType and riskLevel, expand…

8955269

… classifier dataset to per-pair coverage

Merge remote-tracking branch 'origin/main' into feat/auto-routing-eff…

1d424c5

…icient-decision-engine

fix(ai-gateway): align efficient fallback with Qwen-for-all-APIs afte…

3d50441

…r main merge

test(ai-gateway): add sticky field to decision fixture

053373b

iscekic requested review from jeanduplessis and pandemicsyn June 12, 2026 19:02

iscekic added 11 commits June 12, 2026 21:06

feat(dev): move auto-routing workers into their own opt-in dev group

b8a5892

chore(auto-routing): squash benchmark D1 migrations into one baseline

1a5d858

Merge remote-tracking branch 'origin/main' into feat/auto-routing-eff…

c9db589

…icient-decision-engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982

feat(auto-routing): benchmark-driven decision engine and kilo-auto/efficient#3982
iscekic wants to merge 75 commits into
mainfrom
feat/auto-routing-efficient-decision-engine

iscekic commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot Bot commented Jun 11, 2026 •

edited

Loading

Previous review (commit `9eaae60`)

Executive Summary

Overview

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iscekic commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark-driven decision engine and kilo-auto/efficient

Summary

Architecture

Benchmark worker (services/auto-routing-benchmark)

Classifier benchmark

Decider benchmark

Datasets

D1 schema (src/db-schema.ts)

Publishing

Decision engine (services/auto-routing)

Gateway (apps/web)

Design properties

Infrastructure

Post-merge deploy / cutover checklist

Reviewer notes

Uh oh!

Uh oh!

Uh oh!

kilo-code-bot Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Executive Summary

Previous review (commit 9eaae60)

Executive Summary

Overview

WARNING

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iscekic commented Jun 11, 2026 •

edited

Loading

Benchmark-driven decision engine and `kilo-auto/efficient`

Benchmark worker (`services/auto-routing-benchmark`)

D1 schema (`src/db-schema.ts`)

Decision engine (`services/auto-routing`)

Gateway (`apps/web`)

kilo-code-bot Bot commented Jun 11, 2026 •

edited

Loading

Previous review (commit `9eaae60`)